{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Estimating Cargo\n",
    "This example explores how you can use Vektorius to perform more comlex analysis to estimate soybean commodity flows out of Brazil.  \n",
    "\n",
    "In particular we look at\n",
    "\n",
    "  1. using Vektorius to perform data preparation and filtering\n",
    "  2. joining `vessels` and `voyages` data to estimate cargo\n",
    "  3. comparing our results to published statistics¶"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter('ignore')\n",
    "\n",
    "import datetime\n",
    "import json\n",
    "import os\n",
    "\n",
    "from functools import wraps\n",
    "\n",
    "import ibis\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from shapely.geometry import MultiPolygon, shape\n",
    "from descarteslabs.vektorius import vector\n",
    "import displacement\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper functions\n",
    "def str_to_dt(s):\n",
    "    year, month = s.split(\"_\")\n",
    "    return datetime.datetime(int(year), int(month), 1)\n",
    "\n",
    "def displacement_helper(dwt, draft, length, width, return_val):\n",
    "    klass, Cb, Td, disp = displacement.get_displacement(dwt, draft, length, width)\n",
    "    if return_val == \"disp\":\n",
    "        return disp\n",
    "    elif return_val == \"klass\":\n",
    "        return klass\n",
    "\n",
    "def cached_berth(f):\n",
    "    memo = dict()\n",
    "    @wraps(f)\n",
    "    def wrapped(point, shape_dict):\n",
    "        hashable = point.wkt\n",
    "        if hashable in memo:\n",
    "            return memo[hashable]\n",
    "        result = f(point, shape_dict)\n",
    "        memo[hashable] = result\n",
    "        return result\n",
    "    return wrapped\n",
    "\n",
    "@cached_berth\n",
    "def get_berth_from_point(point, shape_dict):\n",
    "    for aoi_name, aoi_poly in shape_dict.items():\n",
    "        if point.within(aoi_poly):\n",
    "            return aoi_name\n",
    "    return np.nan\n",
    "\n",
    "def get_port_from_berth(s):\n",
    "    return s.split(\"_\")[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Construct Vessels Query\n",
    "We're interested in enriching each voyage with information about the ship making the voyage.  To that end we need to locate the Cargo ships that make voyages, and ensure that we have information about the capacity and dimensions of each ship.  Unfortunately, not all vessels have the correct deadweight tonnage (`dwt`) information available, so we want to create some estimated dwt values for vessels that are missing that information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Filter vessels for the right kind of ship"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Constants for queries:\n",
    "ship_type = \"Cargo\"\n",
    "# used for similar-size calculation\n",
    "length_thresh = 15\n",
    "width_thresh = 5\n",
    "\n",
    "# find all cargo ships\n",
    "vessels_table = vector.table(\"vessels\")\n",
    "vessels_query = vessels_table[\n",
    "    vessels_table.mmsi,\n",
    "    vessels_table.length,\n",
    "    vessels_table.width,\n",
    "    vessels_table.ship_type,\n",
    "    # capacity can be in either or both, prefer capacity.dwt if present\n",
    "    vessels_table.capacity.dwt.coalesce(vessels_table.derived_dwt).name(\"dwt\")\n",
    "]\n",
    "vessels_query = vessels_query.filter((\n",
    "    vessels_query.ship_type == ship_type) &\n",
    "    # length and width must be present to estimate displacement\n",
    "    vessels_query.length.notnull() &\n",
    "    vessels_query.width.notnull()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Estimate dwt for ships missing the value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# some cargo ships don't have a listed capacity (`dwt`), so we\n",
    "# need to estimate this value using other cargo ships that \n",
    "# have similar dimensions AND have a listed capacity.\n",
    "# we'll use a self join on the vessels table to find these other\n",
    "# ships and calculate a mean value for each ship missing capacity\n",
    "similar_vessels_table = vessels_table.view()\n",
    "similar_vessels = similar_vessels_table[\n",
    "    similar_vessels_table.length,\n",
    "    similar_vessels_table.width,\n",
    "    similar_vessels_table.ship_type,\n",
    "    # capacity can be in either or both, prefer capacity.dwt if present\n",
    "    similar_vessels_table.capacity.dwt.coalesce(similar_vessels_table.derived_dwt).name(\"dwt\")\n",
    "]\n",
    "# other ships must have dimensions and capacity to be valid for this calculation.\n",
    "# Making an assumption that similar sized Passenger Ships are not valid comparisons\n",
    "# for dwt, since they're designed for different \"cargo\"\n",
    "similar_vessels = similar_vessels.filter(similar_vessels.dwt.notnull() &\n",
    "                                         similar_vessels.length.notnull() & \n",
    "                                         similar_vessels.width.notnull() &\n",
    "                                         (similar_vessels.ship_type == ship_type))\n",
    "\n",
    "# join the target list of vessels to the list of similar vessels\n",
    "# and exclude target vessels that already have a capacity, we don't\n",
    "# need to bother estimating their capacity\n",
    "vessels_missing_dwt = vessels_query.filter(vessels_query.dwt.isnull())\n",
    "joined = vessels_missing_dwt.cross_join(similar_vessels)\n",
    "joined = joined[\n",
    "    vessels_missing_dwt.mmsi,\n",
    "    vessels_missing_dwt.length,\n",
    "    vessels_missing_dwt.width,\n",
    "    vessels_missing_dwt.ship_type,\n",
    "    similar_vessels.dwt,\n",
    "    similar_vessels.length.name(\"similar_length\"),\n",
    "    similar_vessels.width.name(\"similar_width\")\n",
    "]\n",
    "joined.filter(\n",
    "    # length and width of similar ships is +/- some threshhold value\n",
    "    (joined.similar_length.between(joined.length - length_thresh, joined.length + length_thresh)) &\n",
    "    (joined.similar_width.between(joined.width - width_thresh, joined.width + width_thresh))\n",
    ")\n",
    "    \n",
    "# calculate mean value of similar ships for each ship with unknown capacity\n",
    "estimated_dwt_vessels = joined.group_by([\n",
    "    joined.mmsi,\n",
    "    joined.length,\n",
    "    joined.width,\n",
    "    joined.ship_type,\n",
    "]).aggregate(dwt=joined.dwt.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Create the final vessels query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the final list of vessels with listed and estimated capacity\n",
    "final_vessels = vessels_query.left_join(estimated_dwt_vessels,\n",
    "                                        vessels_query.mmsi == estimated_dwt_vessels.mmsi)\n",
    "final_vessels = final_vessels[\n",
    "    vessels_query.mmsi,\n",
    "    vessels_query.length,\n",
    "    vessels_query.width,\n",
    "    vessels_query.ship_type,\n",
    "    vessels_query.dwt.coalesce(estimated_dwt_vessels.dwt).name(\"dwt\"),\n",
    "    # allows us to determine if the dwt was estimated or listed\n",
    "    estimated_dwt_vessels.mmsi.isnull().name(\"estimated\")\n",
    "]\n",
    "# if there still isn't a listed capcity, then remove those ships\n",
    "final_vessels = final_vessels.filter(final_vessels.dwt.notnull() &\n",
    "                                     (final_vessels.dwt != 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Construct Voyages Query\n",
    "Now that we've identified potential cargo ships, we can find the voyages those ships have made and add the vessel info to the voyage info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load brazilian grain berth shapes and make a multipolygon to query from voyages\n",
    "workdir = os.getcwd()\n",
    "fname = os.path.join(workdir, \"brazil_grain-berths.geojson\")\n",
    "\n",
    "# berth names follow the convention {port}_{berth}\n",
    "with open(fname) as f:\n",
    "    feature_collection = json.load(f)\n",
    "\n",
    "polygons = [shape(f[\"geometry\"]) for f in feature_collection[\"features\"]]\n",
    "aoi = MultiPolygon(polygons)\n",
    "\n",
    "voyages_table = vector.table(\"voyages\")\n",
    "\n",
    "departure_filter = voyages_table.departure.between(\"2017-04-01\", \"2020-03-31\")\n",
    "aoi_filter = voyages_table.origin.within(aoi)\n",
    "voyages_query = voyages_table.filter(departure_filter & aoi_filter)\n",
    "\n",
    "final_voyages = final_vessels.inner_join(voyages_query, final_vessels.mmsi == voyages_query.mmsi)\n",
    "final_voyages = final_voyages[\n",
    "    final_vessels.mmsi,\n",
    "    final_vessels.length,\n",
    "    final_vessels.width,\n",
    "    final_vessels.dwt,\n",
    "    final_vessels.ship_type,\n",
    "    voyages_query.departure,\n",
    "    voyages_query.arrival,\n",
    "    voyages_query.origin,\n",
    "    voyages_query.destination,\n",
    "    voyages_query.avg_draft,\n",
    "    voyages_query.avg_speed,\n",
    "    voyages_query.est_cargo,\n",
    "    voyages_query.departure.month().name(\"month\"),\n",
    "    voyages_query.departure.year().name(\"year\")\n",
    "]\n",
    "\n",
    "vdf = final_voyages.execute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we finally have a DataFrame, we can perform some extra data prep that's easier to do with Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fill important missing values with average of valid ones\n",
    "vdf.loc[vdf[\"avg_draft\"].isin([np.nan, np.inf, -np.inf]), \"avg_draft\"] = \\\n",
    "    vdf.loc[~vdf[\"avg_draft\"].isin([np.nan, np.inf, -np.inf]), \"avg_draft\"].mean()\n",
    "\n",
    "vdf.loc[vdf[\"avg_speed\"].isin([np.nan, np.inf, -np.inf]), \"avg_speed\"] = \\\n",
    "    vdf.loc[~vdf[\"avg_speed\"].isin([np.nan, np.inf, -np.inf]), \"avg_speed\"].mean()\n",
    "\n",
    "assert(vdf.loc[vdf[\"avg_draft\"].isin([np.nan, np.inf, -np.inf]), \"avg_draft\"].shape[0] == 0)\n",
    "assert(vdf.loc[vdf[\"avg_speed\"].isin([np.nan, np.inf, -np.inf]), \"avg_speed\"].shape[0] == 0)\n",
    "\n",
    "# extra fields for analysis\n",
    "vdf[\"year_month\"] = vdf[\"year\"].astype(str) + \"_\" + vdf[\"month\"].astype(str)\n",
    "\n",
    "assert(vdf.loc[vdf[\"dwt\"].isin([np.nan, np.inf, -np.inf]), \"dwt\"].shape[0] == 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculate cargo\n",
    "Finally we've collected all the information we need to estimate cargo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate displacement and vessel class\n",
    "vdf[\"est_cargo\"] = vdf.apply(lambda x: displacement_helper(x.dwt, x.avg_draft, x.length, x.width, \"disp\"), axis=1)\n",
    "vdf[\"class\"] = vdf.apply(lambda x: displacement_helper(x.dwt, x.avg_draft, x.length, x.width, \"klass\"), axis=1)\n",
    "\n",
    "# Drop negative displacements (still under investigation)\n",
    "print(vdf.shape)\n",
    "vdf = vdf.loc[vdf[\"est_cargo\"] >= 0]\n",
    "print(vdf.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# take a peek at the results\n",
    "vdf.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compare to Brazilian Soybean Exports by Port\n",
    "Now we can compare our estimated value to the actual statistics published at https://anec.com.br/pt-br/servicos/estatisticas/category/2020-7. To make things a little easier we've put the statistics into an easy to read CSV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load a prepared version of the published version of the statistics\n",
    "fname = os.path.join(workdir, \"anec_soybean_port_exports_201701-202006.csv\")\n",
    "apdf = pd.read_csv(fname)\n",
    "\n",
    "# add a timestamp corresponding to the actual month data\n",
    "apdf[\"date\"] = (apdf[\"year\"].astype(str) + \"_\" + apdf[\"month\"].astype(str)).apply(str_to_dt)\n",
    "apdf = apdf[[\"date\", \"volume\", \"port\"]]\n",
    "\n",
    "apdf.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tag origin port in existing voyages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vdf = vdf.copy()\n",
    "\n",
    "# first make a dictionary of shapely shapes for each berth\n",
    "aoi_polys = {feat[\"properties\"][\"name\"]: shape(feat[\"geometry\"])\n",
    "             for feat in feature_collection[\"features\"]}\n",
    "\n",
    "# identify berth of origin and identify port as well - # berth names follow the convention {port}_{berth}\n",
    "vdf[\"berth\"] = vdf.apply(lambda x: get_berth_from_point(x.origin, aoi_polys), axis=1)\n",
    "vdf[\"port\"] = vdf[\"berth\"].apply(get_port_from_berth)\n",
    "\n",
    "assert((vdf.loc[vdf[\"berth\"].isna()].shape[0] == 0) & (vdf.loc[vdf[\"port\"].isna()].shape[0] == 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare for one port"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Look at Santos port as an example\n",
    "uport = \"santos\"\n",
    "\n",
    "# isolate the export data for this port\n",
    "apdf_tmp = apdf[apdf[\"port\"] == uport].copy()\n",
    "apdf_tmp= apdf_tmp.set_index(\"date\")\n",
    "\n",
    "# isolate the voyage data for this port and group by month\n",
    "vdf_tmp = vdf[vdf[\"port\"] == uport].copy().groupby(\"year_month\").sum()[[\"est_cargo\"]].reset_index()\n",
    "vdf_tmp[\"date\"] = vdf_tmp[\"year_month\"].apply(str_to_dt)\n",
    "vdf_tmp = vdf_tmp.sort_values(by=\"date\").drop(\"year_month\", axis=1).set_index(\"date\")\n",
    "\n",
    "# merge the two\n",
    "merge_tmp = vdf_tmp.merge(apdf_tmp, how='outer', left_index=True, right_index=True).dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig,ax = plt.subplots(1,2,figsize=(15,5))\n",
    "\n",
    "# time series comparison\n",
    "ax[0].plot(merge_tmp.index, merge_tmp[\"est_cargo\"] / 1e3, '-o', label=\"voyages_cargo\")\n",
    "ax[0].plot(merge_tmp.index, merge_tmp[\"volume\"] / 1e3, '-o', label=\"anec_volume\")\n",
    "ax[0].set_title(uport, fontsize=15)\n",
    "ax[0].grid(True)\n",
    "ax[0].legend(loc=\"upper left\")\n",
    "ax[0].set_ylabel(\"Thousand Tons\")\n",
    "\n",
    "# scatter plot comparison\n",
    "ax[1].scatter(merge_tmp[\"est_cargo\"].values, merge_tmp[\"volume\"].values)\n",
    "ax[1].grid(True)\n",
    "ax[1].set_ylabel(\"anec volume (1000 tons)\")\n",
    "ax[1].set_xlabel(\"voyages displacement (1000 tons)\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filter voyages ending in China\n",
    "There is some correlation present, but not enough to draw a meaningful relationship. More than 80% of Brazil's soybean exports go to China, filtering for voyages ending in China should tighten up the signal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "china_coast_fc = {\n",
    "  \"type\": \"FeatureCollection\",\n",
    "  \"features\": [\n",
    "    {\n",
    "      \"type\": \"Feature\",\n",
    "      \"properties\": {\"name\": \"china_coast\"},\n",
    "      \"geometry\": {\n",
    "        \"type\": \"Polygon\",\n",
    "        \"coordinates\": [\n",
    "          [[120.498046875, 41.77131167976407],[117.42187500000001, 39.842286020743394],[116.3671875, 37.579412513438385],[118.38867187500001, 35.10193405724606],[118.38867187500001, 31.952162238024975],[117.158203125, 29.53522956294847],[113.90625, 26.980828590472107],[110.91796875, 25.403584973186703],[107.314453125, 23.725011735951796],[106.61132812499999, 23.079731762449878],[108.06152343749999, 21.49396356306447],[109.423828125, 20.838277806058933],[111.796875, 20.715015145512087],[114.697265625, 21.3303150734318],[117.50976562499999, 22.59372606392931],[119.17968749999999, 24.407137917727667],[121.5087890625, 27.176469131898898],[122.6953125, 29.99300228455108],[123.26660156249999, 31.466153715024294],[121.904296875, 33.90689555128866],[121.06933593749999, 35.639441068973944],[123.31054687499999, 36.70365959719456],[124.1455078125, 38.09998264736481],[124.23339843749999, 40.07807142745009],[122.607421875, 41.244772343082076],[120.498046875, 41.77131167976407]\n",
    "          ]]\n",
    "      }\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "\n",
    "# make a dict of shapely shapes\n",
    "china_coast_polys = {feat[\"properties\"][\"name\"]: shape(feat[\"geometry\"])\n",
    "                     for feat in china_coast_fc[\"features\"]}\n",
    "\n",
    "# identify voyages that ended in China and drop all others\n",
    "vdf[\"china_dest\"] = vdf.apply(lambda x: get_berth_from_point(x.destination, china_coast_polys), axis=1)\n",
    "\n",
    "print(vdf.shape)\n",
    "vdf = vdf[~vdf[\"china_dest\"].isna()]\n",
    "print(vdf.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# re-examine Santos port\n",
    "uport = \"santos\"\n",
    "\n",
    "# isolate the export data for this port\n",
    "apdf_tmp = apdf[apdf[\"port\"] == uport].copy()\n",
    "apdf_tmp= apdf_tmp.set_index(\"date\")\n",
    "\n",
    "# isolate the voyage data for this port and group by month\n",
    "vdf_tmp = vdf[vdf[\"port\"] == uport].copy().groupby(\"year_month\").sum()[[\"est_cargo\"]].reset_index()\n",
    "vdf_tmp[\"date\"] = vdf_tmp[\"year_month\"].apply(str_to_dt)\n",
    "vdf_tmp = vdf_tmp.sort_values(by=\"date\").drop(\"year_month\", axis=1).set_index(\"date\")\n",
    "\n",
    "# merge the two\n",
    "merge_tmp = vdf_tmp.merge(apdf_tmp, how='outer', left_index=True, right_index=True).dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig,ax = plt.subplots(1,2,figsize=(15,5))\n",
    "\n",
    "# time series comparison\n",
    "ax[0].plot(merge_tmp.index, merge_tmp[\"est_cargo\"] / 1e3, '-o', label=\"voyages_cargo\")\n",
    "ax[0].plot(merge_tmp.index, merge_tmp[\"volume\"] / 1e3, '-o', label=\"anec_volume\")\n",
    "ax[0].set_title(uport, fontsize=15)\n",
    "ax[0].grid(True)\n",
    "ax[0].legend(loc=\"upper left\")\n",
    "ax[0].set_ylabel(\"Thousand Tons\")\n",
    "\n",
    "# scatter plot comparison\n",
    "ax[1].scatter(merge_tmp[\"est_cargo\"].values, merge_tmp[\"volume\"].values)\n",
    "ax[1].grid(True)\n",
    "ax[1].set_ylabel(\"anec volume (1000 tons)\")\n",
    "ax[1].set_xlabel(\"voyages displacement (1000 tons)\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The estimated volume magnitude is off, signaling some improvements needed in cargo calculation, but the correlation is much tighter"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
